IRIX Base Documentation 2001 May

home *** CD-ROM | disk | FTP | other *** search

/ IRIX Base Documentation 2001 May / SGI IRIX Base Documentation 2001 May.iso / usr / share / catman / p_man / cat5 / auto_p.z / auto_p

Wrap

Text File | 1998-10-30 | 57.1 KB | 1,387 lines

AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) NNNNAAAAMMMMEEEE AUTO_P - Automatic Parallelization TTTTOOOOPPPPIIIICCCC This man page discusses automatic parallelization and how to achieve it with the Silicon Graphics MIPSpro Automatic Parallelization Option. The following topics are covered: _A_u_t_o_m_a_t_i_c _P_a_r_a_l_l_e_l_i_z_a_t_i_o_n _a_n_d _t_h_e _M_I_P_S_p_r_o _C_o_m_p_i_l_e_r_s _U_s_i_n_g _t_h_e _M_I_P_S_p_r_o _A_u_t_o_m_a_t_i_c _P_a_r_a_l_l_e_l_i_z_a_t_i_o_n _O_p_t_i_o_n AAAAuuuuttttoooommmmaaaattttiiiicccc PPPPaaaarrrraaaalllllllleeeelllliiiizzzzaaaattttiiiioooonnnn aaaannnndddd tttthhhheeee MMMMIIIIPPPPSSSSpppprrrroooo CCCCoooommmmppppiiiilllleeeerrrrssss Parallelization is the process of analyzing sequential programs for parallelism so that they may be restructured to run efficiently on multiprocessor systems. The goal is to minimize the overall computation time by distributing the computational work load among the available processors. Parallelization can be automatic or manual. During automatic parallelization, the MIPSpro Automatic Parallelization Option, hereafter called the auto-parallelizer, analyzes and structures the program with little or no intervention by the developer. The auto- parallelizer can automatically generate code that splits the processing of loops among multiple processors. The alternative is manual parallelization by which the developer performs the parallelization using pragmas and other programming techniques. Manual parallelization is discussed in the mp(3f) and mp(3c) man pages. Automatic parallelization begins with the determination of data dependence of variables and arrays in loops. Data dependence can prevent loops from being safely run in parallel because the final outcome of the computation may vary depending on the order the various processors access the variables and arrays. Data dependence and other obstacles to parallelization are discussed in more detail in the next section. Once data dependences are resolved, a number of automatic parallelization strategies can be employed. They can consist of the following: _L_o_o_p _i_n_t_e_r_c_h_a_n_g_e _o_f _n_e_s_t_e_d _l_o_o_p_s _S_c_a_l_a_r _e_x_p_a_n_s_i_o_n _L_o_o_p _d_i_s_t_r_i_b_u_t_i_o_n _A_u_t_o_m_a_t_i_c _s_y_n_c_h_r_o_n_i_z_a_t_i_o_n _o_f _D_O_A_C_R_O_S_S _l_o_o_p_s _I_n_t_r_a_p_r_o_c_e_d_u_r_a_l _a_r_r_a_y _p_r_i_v_a_t_i_z_a_t_i_o_n The 7.2 release of the MIPSpro compilers marks a major revision of the auto-parallelizer. The new release incorporates automatic parallelization into the other optimizations performed by the MIPSpro compilers. Previous PPPPaaaaggggeeee 1111 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) versions relied on preprocessors to provide source-to-source conversions prior to compilation. This change provides several benefits to developers: Automatic parallelization is integrated with optimizations for single processors A set of options and pragmas consistent with the rest of the MIPSpro compilers Support for C++ Better run-time and compile-time performance TTTThhhheeee MMMMIIIIPPPPSSSSpppprrrroooo AAAAuuuuttttoooommmmaaaattttiiiicccc PPPPaaaarrrraaaalllllllleeeelllliiiizzzzaaaattttiiiioooonnnn OOOOppppttttiiiioooonnnn Developers exploit parallelism in programs to provide better performance on multiprocessor systems. You do not need a multiprocessor system to use th e automatic parallelizer. Although there is a slight performance loss when a single-processor system runs multiprocessed code, you can use the auto-parallelizer on any Silicon Graphics system to create and debug a program. The automatic parallelizer is an optional software product that is used as an extension to the following compilers: MIPSpro Fortran 77 MIPSpro Fortran 90 MIPSpro C MIPSpro C++ It is controlled by flags inserted in the command lines that invoke the supported compilers. UUUUssssiiiinnnngggg tttthhhheeee MMMMIIIIPPPPSSSSpppprrrroooo AAAAuuuuttttoooommmmaaaattttiiiicccc PPPPaaaarrrraaaalllllllleeeelllliiiizzzzeeeerrrr This section describes how to use the auto-parallelizer when you compile and run programs with the MIPSpro compilers. UUUUssssiiiinnnngggg tttthhhheeee MMMMIIIIPPPPSSSSpppprrrroooo CCCCoooommmmppppiiiilllleeeerrrrssss ttttoooo PPPPaaaarrrraaaalllllllleeeelllliiiizzzzeeee PPPPrrrrooooggggrrrraaaammmmssss You invoke the auto-parallelizer by using the ----ppppffffaaaa or ----ppppccccaaaa flags on the command lines for the MIPSpro compilers. The syntax for compiling programs with the auto-parallelizer is as follows: For Fortran 77 and Fortran 90 use ----ppppffffaaaa:::: %%%%ffff77777777 _o_p_t_i_o_n_s ----ppppffffaaaa [{ lllliiiisssstttt | kkkkeeeeeeeepppp }] [ ----mmmmpppplllliiiisssstttt ] _f_i_l_e_n_a_m_e PPPPaaaaggggeeee 2222 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) %%%%ffff99990000 _o_p_t_i_o_n_s ----ppppffffaaaa [{ lllliiiisssstttt | kkkkeeeeeeeepppp }] [ ----mmmmpppplllliiiisssstttt ] _f_i_l_e_n_a_m_e For C and C++ use -pca: %%%%cccccccc _o_p_t_i_o_n_s ----ppppccccaaaa [{ lllliiiisssstttt | kkkkeeeeeeeepppp }] [ ----mmmmpppplllliiiisssstttt ] _f_i_l_e_n_a_m_e %%%%CCCCCCCC _o_p_t_i_o_n_s ----ppppccccaaaa [{ lllliiiisssstttt | kkkkeeeeeeeepppp }] [ ----mmmmpppplllliiiisssstttt ] _f_i_l_e_n_a_m_e where _o_p_t_i_o_n_s are MIPSPro compiler command-line options. For details on the other options see the documentation for your MIPSPro compiler. ----ppppffffaaaa and ----ppppccccaaaa Invoke the auto-parallelizer and enable any multiprocessing directives. lllliiiisssstttt Produce an annotated listing of the parts of the program that can (and cannot) run in parallel on multiple processors. The listing file has the suffix .l. kkkkeeeeeeeepppp Generate the listing file (.l), and the transformed equivalent program (.m), and creates an output file for use with WorkShop Pro MPF (.anl). ----mmmmpppplllliiiisssstttt Generate a transformed equivalent program in a .w2f.f file for Fortran 77 or a .w2c.c file for C. ffffiiiilllleeeennnnaaaammmmeeee The name of the file containing the source code. To use the automatic parallelizer with Fortran programs, add the ----ppppffffaaaa flag to both the compile and link line. For C or C++, add the ----ppppccccaaaa flag. If you link separately, you must also add ----mmmmpppp to the link line. Previous versions of the Power compilers had a large set of flags to control optimization. The 7.2 version uses the same set of options as the rest of the MIPSPro compilers. So, for example, while in the older Power compilers the option ----ppppffffaaaa,,,,----rrrr====0000 turned off roundoff changing transformations in the pfa preprocessor, in the new compiler ----OOOOPPPPTTTT::::rrrroooouuuunnnnddddooooffffffff====0000 turns off roundoff changing transformations in all phases of the compiler. The ----ppppffffaaaa list option generates a .l file. The .l file lists the loops in your code, indicating which were parallelized and which were not. If any were not parallelized, it explains why not. The ----ppppffffaaaa kkkkeeeeeeeepppp option generates a .l, a .m file and a .anl file that is used by the Workshop ProMPF tool. The .m file is similar to the .w2f.f or .w2c.c file except that the file is annotated with some information used by Workshop ProMPF PPPPaaaaggggeeee 3333 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) tool. The ----mmmmpppplllliiiisssstttt option will, in addition to compiling your program, generate a .w2f.f file (for Fortran 77, .w2c.c file for C) that represents the program after the automatic parallelization phase. These programs should be readable and in most cases should be valid code suitable for recompilation. The ----mmmmpppplllliiiisssstttt option can be used to see what portions of your code were parallelized. For Fortran 90 and C++, automatic parallelization happens after the source program has been converted into an internal representation. It is not possible to regenerate Fortran 90 or C++ after parallelization. Examples: Analyzing a .l File %cat foo.f subroutine sub(arr,n) real*8 arr(n) do i=1,n arr(i) = arr(i) + arr(i-1) end do do i=1,n arr(i) = arr(i) + 7.0 call foo(a) end do do i=1,n arr(i) = arr(i) + 7.0 end do end %f77 -O3 -n32 -mips4 -pfa list foo.f -c. Here's the associated .l file Parallelization Log for Subprogram sub_ 3: Not Parallel Array dependence from arr on line 4 to arr on line 4. 6: Not Parallel Call foo on line 8. 10: PARALLEL (Auto) __mpdo_sub_1 Example Analyzing a .w2f.f File %cat test.f subroutine trivial(a) real a(10000) PPPPaaaaggggeeee 4444 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) do i=1,10000 a(i) = 0.0 end do end %f77 -O3 -n32 -mips4 -c -pfa -c -mplist test.f We get both an object file, test.o, and a test.w2f.f file that contains the following code SUBROUTINE trivial(a) IMPLICIT NONE REAL*4 a(10000_8) INTEGER*4 i C$DOACROSS local(i), shared(a) DO i = 1, 10000, 1 a(i) = 0.0 END DO RETURN END ! trivial RRRRuuuunnnnnnnniiiinnnngggg YYYYoooouuuurrrr PPPPrrrrooooggggrrrraaaammmm Invoke your program as if it were a sequential program. The same binary can execute using different numbers of processors. By default, the runtime will selec t how many processors to use based on the number of processors in the machine. The developer can use the environment variable, NNNNUUUUMMMM____TTTTHHHHRRRREEEEAAAADDDDSSSS,,,, to change the default to use an explicit number of processors. In addition, the developer can have the number of processors vary dynamically from loop to loop based on system load by setting the environment variable MMMMPPPP____SSSSUUUUGGGGNNNNUUUUMMMMTTTTHHHHDDDD.... Refer to the mp(3f) and mp(3c) for more details. Simply passing code through the auto-parallelizer does not always produce s all the increased performance available. In the next chapter, we discuss strategies for making effective use of the product when the auto-parallelizer is not able to fully parallelize an application. AAAAnnnnaaaallllyyyyzzzziiiinnnngggg tttthhhheeee AAAAuuuuttttoooommmmaaaattttiiiicccc PPPPaaaarrrraaaalllllllleeeelllliiiizzzzeeeerrrr''''ssss RRRReeeessssuuuullllttttssss PPPPaaaaggggeeee 5555 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) Running a program through the auto-parallelizer often results in excellent parallel speedups, but there are cases that cannot be automatically well parallelized. By understanding the listing files, you can sometimes identify small problems that prevent a loop from running safely in parallel. With a relatively small amount of work, you can remove these data dependencies and dramatically improve the program's performance. Hint: When trying to find loops to run in parallel, focus your efforts on the areas of the code that use the bulk of the run time. Spending time trying to run a routine in parallel that uses only one percent of the run time of the program cannot significantly improve the overall performance of your program. To determine where your code spends its time, take an execution profile of the program using the Speedshop performance tools. The auto-parallelizer provides several mechanisms to analyze what it did. For Fortran 77 and C programs, the ----mmmmpppplllliiiisssstttt the code after parallelization. Manual parallelism directives are inserted on loops that have been automatically parallelized. For details about these directives, refer to Chapters 5-7, "Fortran Enhancements for Multiprocessors," of the MIPSpro Fortran 77 Programmer's Guide", or Chapter 11, "Multiprocessing C/C++ Compiler Directives," of the C Language Reference Manual. The output code in the .w2f.f or .w2c.c file should be readable and under standable. The user can use it as a tool to gain insight into what the auto-parallelizer did. The user can then use that insight to make changes to the original source program. Note that the auto-parallelizer is not a source to source preprocessor, but is instead an internal phase of the MIPSPro compilers. With a preprocessor system, a post parallelization file would always be generated and fed into the regular compiler. This is not the case with the auto-parallelizer. Therefore, compiling a .w2f.f or .w2c.c file through a MIPSPro compiler will not generate identical code to compiling the original source through the MIPSPro auto-parallelizer. But, often the two will be almost the same. The auto-parallelizer also provides a listing mechanism via the ----ppppffffaaaa or ----ppppccccaaaa kkkkeeeeeeeepppp or ----ppppffffaaaa or ----ppppccccaaaa list option. This will cause the compiler to generate a .l file. The .l file will list the original loops in the program along with messages telling whether or not the loops were parallelized. For loops that were not parallelized, an explanation will be given. Parallelization Failures With the Automatic Parallelizer This section discusses mistakes you can avoid and actions you can take to enhance the performance of the auto-parallelizer. The auto-parallelizer is not always able to parallelize programs effectively. This can be true for a number of reason s, some of which you can address. There are three PPPPaaaaggggeeee 6666 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) broad categories of parallelization failure: The auto-parallelizer does not detect that a loop is safe to parallelize The auto-parallelizer chooses the wrong nested loop to make parallel The auto-parallelizer parallelizes a loop that would run more efficiently sequentially FFFFaaaaiiiilllluuuurrrreeee ttttoooo RRRReeeeccccooooggggnnnniiiizzzzeeee SSSSaaaaffffeeee LLLLooooooooppppssss We want the auto-parallelizer to recognize every loop that is safe to par allelize. A loop is not safe if there is data dependence, so the automatic parallelizer analyzes each loop in a sequential program to try to prove it is safe. If it cannot prove a loop is safe, it does not do the parallelization. A loop that contains any of the constructs described in this section may not be proved safe. However, in many instances the loop can be proved safe after minor changes. You should review your program's .l file, to see if there are any of these constructs in your code. Usually the failure to recognize a loop as safe is related to one or more of the following practices:. Function Calls in Loops GO TO Statements in Loops Complicated Array Subscripts Conditionally Assigned Temporary Variables in Loops" Unanalyzable Pointer Usage in C/C++ FFFFuuuunnnnccccttttiiiioooonnnn CCCCaaaallllllllssss iiiinnnn LLLLooooooooppppssss By default, the auto-parallelizer does not parallelize a loop that contains a function call because the function in one iteration may modify or depend on data in other iterations of the loop. However, a couple of tools can help with this problem. Interprocedural analysis, specified by the ----IIIIPPPPAAAA command-line option, can provide the auto-parallelizer with enough additional information to parallelize some loops that contain function calls. For more information on interprocedural analysis, see the _M_I_P_S_p_r_o _C_o_m_p_i_l_i_n_g _a_n_d _P_e_r_f_o_r_m_a_n_c_e _T_u_n_i_n_g _G_u_i_d_e. TTTThhhheeee CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT CCCCAAAALLLLLLLL Fortran assertion, discussed below allows you to tell the auto-parallelizer to ignore function calls when analyzing the specified loops. GGGGOOOO TTTTOOOO SSSSttttaaaatttteeeemmmmeeeennnnttttssss iiiinnnn LLLLooooooooppppssss PPPPaaaaggggeeee 7777 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) The use of GGGGOOOO TTTTOOOO statements in loops can cause two problems: Early exits from loops. It is not possible to parallelize loops with early exits, either automatically or manually. Unstructured control flows. The auto-parallelizer attempts to convert unstructured control flows in loops into structured constructs. If the auto-parallelizer cannot restructure these control flows, your only alternatives are manual parallelization or restructuring the code. CCCCoooommmmpppplllliiiiccccaaaatttteeeedddd AAAArrrrrrrraaaayyyy SSSSuuuubbbbssssccccrrrriiiippppttttssss There are several cases where array subscripts are too complicated to permit parallelization. IIIInnnnddddiiiirrrreeeecccctttt AAAArrrrrrrraaaayyyy RRRReeeeffffeeeerrrreeeennnncccceeeessss The auto-parallelizer is not able to analyze indirect array references. Consider the following Fortran example. do i= 1,n a(b(i)) ... end do This loop cannot be run safely in parallel if the indirect reference b(i) is equal to the same value for different iterations of i. If every element of array b is unique, the loop can safely be made parallel. In such cases, use either manual methods or the CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT PPPPEEEERRRRMMMMUUUUTTTTAAAATTTTIIIIOOOONNNN Fortran directive discussed below, to achieve parallelism. UUUUnnnnaaaannnnaaaallllyyyyzzzzaaaabbbblllleeee SSSSuuuubbbbssssccccrrrriiiippppttttssss The auto-parallelizer cannot parallelize loops containing arrays with unanalyzable subscripts. In the following case, the auto- parallelizer is not able to analyze the / in the array subscript and cannot reorder the loop. do i = l,u,2 a(i/2) = ... Changed to (). end do HHHHiiiiddddddddeeeennnn KKKKnnnnoooowwwwlllleeeeddddggggeeee In the following example there may be hidden knowledge about the relationship between the variables m and n. PPPPaaaaggggeeee 8888 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) do i = 1,n a(i) = a(i+m) Changed to (). end do The loop can be run in parallel if m > n, because the arrays will not overlap. However, because the auto-parallelizer does not know the value of the variables, it cannot make the loop parallel. CCCCoooonnnnddddiiiittttiiiioooonnnnaaaallllllllyyyy AAAAssssssssiiiiggggnnnneeeedddd TTTTeeeemmmmppppoooorrrraaaarrrryyyy VVVVaaaarrrriiiiaaaabbbblllleeeessss iiiinnnn LLLLooooooooppppssss When parallelizing a loop, the auto-parallelizer often localizes (privatizes) temporary scalar and array variables. Consider the following example. do i = 1,n do j = 1,n tmp(j) = ... end do do j = 1,n a(j,i) = a(j,i) + tmp(j) end do end do The array tmp is used for local scratch space. To successfully parallelize the outer (i) loop, each processor must be given a distinct, private tmp array. In this example, the auto-parallelizer is able to localize tmp and parallelize the loop. The auto-parallelizer runs into trouble when a conditionally assigned temporary variable might be used outside of the loop, as in the following example. subroutine s1(a,b) common t ... do i = 1,n if (b(i)) then t = ... a(i) = a(i) + t PPPPaaaaggggeeee 9999 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) end if end do call s2() If the loop were to be run in parallel, a problem would arise if the value of t were used inside subroutine s2(). Which processor's private copy of t should s2() use? If t were not conditionally assigned, the answer would be the processor that executed iteration n. But t is conditionally assigned and the auto-parallelizer cannot determine which copy to use. The loop is inherently parallel if the conditionally assigned variable t is localized. If the value of t is not used outside the loop, you should replace t with a local variable. Unless t is a local variable, the auto- parallelizer must assume that s2() might use it. UUUUnnnnaaaannnnaaaallllyyyyzzzzaaaabbbblllleeee PPPPooooiiiinnnntttteeeerrrr UUUUssssaaaaggggeeee iiiinnnn CCCC////CCCC++++++++ The C and C++ languages have features that make them more difficult than Fortran to automatically parallelize. Many of these features are related to the use of pointers. The following practices involving pointers interfere with the auto-parallelizer's effectiveness: AAAArrrrbbbbiiiittttrrrraaaarrrryyyy PPPPooooiiiinnnntttteeeerrrr DDDDeeeerrrreeeeffffeeeerrrreeeennnncccceeeessss The auto-parallelizer does not analyze arbitrary pointer dereferences. The only pointers it analyzes are array references and pointer dereferences that can be converted into array references. The auto-parallelizer can subdivide the trees formed by dereferencing arbitrary pointers and run the parts in parallel. However, it cannot determine if the tree is really a directed graph with an unsafe multiple reference. Therefore the parallelization is not done. AAAArrrrrrrraaaayyyyssss ooooffff AAAArrrrrrrraaaayyyyssss Multidimensional arrays are sometimes implemented as arrays of arrays. Consider this example: double **p; for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) p[i][j] = ... If p is a true multi-dimensional array, the outer loop can be run safely in parallel. If two of the array pointers, p[2] and p[3] for example, reference the same array, the loop must not be run in PPPPaaaaggggeeee 11110000 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) parallel. Although this duplicate reference is unlikely, the auto- parallelizer cannot prove it doesn't exist. You can avoid this problem by always using true arrays. To parallelize the code fragment above, rewrite it as follows: double p[n][n]; for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) p[i][j] = ... Note: Although ANSI C does not allow variable-sized multi- dimensional arrays, there is a proposal to allow them in the next standard. The MIPSPro 7.2 auto-parallelizer already implements this proposal. LLLLooooooooppppssss BBBBoooouuuunnnnddddeeeedddd bbbbyyyy PPPPooooiiiinnnntttteeeerrrr CCCCoooommmmppppaaaarrrriiiissssoooonnnnssss The auto-parallelizer reorders only those loops in which the number of it erations can be exactly determined. In Fortran programs this is rarely a problem, but in C and C++ subtle issues relating to overflow and unsigned arithmetic can come to play. One consequence of this is that loops should not be bounded by pointer comparisons such as int* pl, pu; for (int *p = pl; p != pu; p++) This loop cannot be made parallel, and compiling it will result in a .l file entry stating the bound cannot be standardized. To avoid this result, restructure the loop to be of the form int lb, ub; for (int i = lb; i <= ub; i++) AAAAlllliiiiaaaasssseeeedddd PPPPaaaarrrraaaammmmeeeetttteeeerrrr IIIInnnnffffoooorrrrmmmmaaaattttiiiioooonnnn Perhaps the most frequent impediment to parallelizing C and C++ is aliased information. Although Fortran guarantees that multiple parameters to a subroutine are not aliased to each other, C and C++ do not. Consider the following example: void sub(double *a, double *b,n) { for (int i = 0; i < n; i++) PPPPaaaaggggeeee 11111111 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) a[i] = b[i]; This loop can be parallelized only if arrays a and b do not overlap. With the option -OPT:alias=restrict, you can assure the auto- parallelizer that the arrays do not overlap. This assurance permits the auto-parallelizer to proceed with the parallelization. See the MIPSpro Compiling and Performance Tuning Guide for details about this option. IIIInnnnccccoooorrrrrrrreeeeccccttttllllyyyy PPPPaaaarrrraaaalllllllleeeelllliiiizzzzeeeedddd NNNNeeeesssstttteeeedddd LLLLooooooooppppssss The auto-parallelizer parallelizes a loop by distributing its iterations among the available processors. Because the resulting performance is usually better, the auto- parallelizer tries to parallelize the outermost loop. If it cannot do so, probably for one of the reasons mentioned in the previous section, it tries to interchange the outermost loop with an inner one that it can parallelize. Example Nested Loops do i = 1,n do j = 1,n ... end do end do Even when most of your program is parallelized, it is possible that the wrong loop is parallelized. Given a nest of loops, the auto- parallelizer will only parallelize one of the loops in the nest. In general, it is better to parallelize outer loops rather than inner ones. The auto-parallelizer will try to either parallelize the outer loop or in terchange the parallel loop so that it will be outermost, but sometimes it is not possible. For any of the reasons mentioned in the previous section, the auto-parallelizer might be able to parallelize an inner loop but not the outer one. Even if this results in most of your code being parallelized, it might be advantageous to modify your code so that the outer loop is parallelized. It is better to parallelize loops that do not have very small trip counts. Consider the following example. do i = 1,m PPPPaaaaggggeeee 11112222 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) do j = 1,n The auto-parallelizer may decide to parallelize the i loop, but if m is v ery small, it would be better to interchange the j loop to be outermost and then parallelize it. The auto-parallelizer might not have any way to know that m is small. In such cases, the user can either use the CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO PPPPRRRREEEEFFFFEEEERRRR directives discussed in the next section to tell the auto- parallelizer that it is better to parallelize the j loop, or the user can use manual parallelism directives. Because of memory hierarchies, performance can be improved if the same processors access the same data in all parallel loop nests. Consider the following two examples. Example Inefficient Loop do i = 1,n ...a(i) end do do i = n,1 ...a(i)... end do Assume that there are p processors. In the first loop, the first processor will access the first n/p elements of a, the second processor will access the next n/p and so on. In the second loop, the first processor will access the last n/p elements of a. Assuming n is not too large, those elements will be in the cache of the a different processor. Accessing data that is in some other processor's cache can be very expensive. This example might run much more efficiently if we reverse the direction of one of the loops. Example Efficient Loop do i = 1,n do j = 1,n a(i,j) = b(j,i) + ... end do end do PPPPaaaaggggeeee 11113333 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) do i = 1,n do j = 1,n b(i,j) = a(j,i) + ... end do end do In this second example, the auto-parallelizer might chose to parallelize the outer loop in both nests. This means that in the first loop the first processor is accessing the first n/p rows of a and the first n/p columns of b, while in the second loop the first processor is accessing the first n/p columns of a and the first n/p rows of b. This example will run much more efficiently if we parallelize the i loop in one nest and the j loop in the other. The user can add the prefer directives described in the next section to solve this problem. UUUUnnnnnnnneeeecccceeeessssssssaaaarrrriiiillllyyyy PPPPaaaarrrraaaalllllllleeeelllliiiizzzzeeeedddd LLLLooooooooppppssss The auto-parallelizer may parallelize loops that would run better sequent ially. While this is usually not a disaster, it can cause unnecessary overhead. There is a certain overhead to running loops in parallel. If, for example, a loop has a small number of iterations, it is faster to execute the loop sequentially. When bounds are unknown (and even sometimes when they are known), the auto-parallelizer parallelizes loops conditionally. In other words, code is generated for both a parallel and sequential version of the loop. The parallel version is executed only when the auto-parallelizer thinks that there is sufficient work for it to be worthwhile to execute the loop in parallel. This estimate depends on the iteration count, what code is inside the loop body, how many processors are available and the auto-parallelizer estimate for the overhead cost to invoke a parallel loop. This user can control the compiler's estimate for the invocation overhead using the option ----LLLLNNNNOOOO::::ppppaaaarrrraaaalllllllleeeellll____oooovvvveeeerrrrhhhheeeeaaaadddd====nnnn.... The default value for n will vary on different systems, but typical values are in the low thousands. By generating two versions of the loop, we avoid going parallel in small trip count cases, but versioning does incur an overhead to do the dynamic check. The user can use the DDDDOOOO PPPPRRRREEEEFFFFEEEERRRR assertions to insure that a loop goes parallel or sequential without incurring a run-time test. Nested parallelism is not supported. Consider the following case: subroutine caller do i call sub PPPPaaaaggggeeee 11114444 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) end do subroutine sub ... do i .. end do end Suppose that the first loop is parallelized. It is not possible to execute the loop inside sub in parallel whenever sub is called by caller. Thus the auto-parallelizer must generate a test for every parallel loop that checks whether the loop is being invoked from another parallel loop or region. While this check is not very expensive, in some cases it can add to overhead. If the user knows that sub is always called from caller, the user can use the prefer directives to force the loop in sub to go sequential. AAAAssssssssiiiissssttttiiiinnnngggg tttthhhheeee SSSSiiiilllliiiiccccoooonnnn GGGGrrrraaaapppphhhhiiiiccccssss AAAAuuuuttttoooommmmaaaattttiiiicccc PPPPaaaarrrraaaalllllllleeeelllliiiizzzzeeeerrrr This section discusses actions you can take to enhance the performance of the auto-parallelizer. AAAAssssssssiiiissssttttiiiinnnngggg tttthhhheeee AAAAuuuuttttoooommmmaaaattttiiiicccc PPPPaaaarrrraaaalllllllleeeelllliiiizzzzeeeerrrr There are circumstances that interfere with the auto-parallelizer's ability to optimize programs. As shown in _P_a_r_a_l_l_e_l_i_z_a_t_i_o_n _F_a_i_l_u_r_e_s _W_i_t_h _t_h_e _A_u_t_o_m_a_t_i_c _P_a_r_a_l_l_e_l_i_z_e_r, problems are sometimes caused by coding practices. Other times, the auto-parallelizer does not have enough information to make good parallelization decisions. You can pursue three strategies to attack these problems and achieve better results with the auto-parallelizer. The first approach is to modify your code to avoid coding practices that the auto-parallelizer cannot analyze well. The second strategy is to assist the auto-parallelizer with the manual parallelization directives described in the MIPSpro Compiling and Performance Tuning Guide. The auto-parallelizer is designed to recognize and coexist with manual parallelism. You can use manual directives with some loop nests, while leaving others to the auto-parallelizer. This approach has both positive and negative aspects. On the positive side, the manual parallelism directives are well defined and deterministic. If you use a manual directive, the specified loop will run in parallel. Note: This last statement assumes that the trip count is greater than PPPPaaaaggggeeee 11115555 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) one and that the specified loop is not nested in another parallel loop. On the negative side, you must carefully analyze the code to determine that parallelism is safe. Also, you must mark all variables that need to be localized. The third alternative is to use the automatic parallelization directives and assertions to give the auto-parallelizer more information about your code. The automatic directives and assertions are described in _D_i_r_e_c_t_i_v_e_s _a_n_d _A_s_s_e_r_t_i_o_n_s _f_o_r _A_u_t_o_m_a_t_i_c _P_a_r_a_l_l_e_l_i_z_a_t_i_o_n. Like the manual directives, they have positive and negative features: On the positive side, automatic directives and assertions are easier to use and they allow you to express the information you know without your having to be certain that all the conditions for parallelization are met. On the negative side, they are hints and thus do not impose parallelism. In addition, as with the manual directives, you must ensure that you are using them legally. Because they require less information than the manual directives, automatic directives and assertions can have subtle meanings. DDDDiiiirrrreeeeccccttttiiiivvvveeeessss aaaannnndddd AAAAsssssssseeeerrrrttttiiiioooonnnnssss ffffoooorrrr AAAAuuuuttttoooommmmaaaattttiiiicccc PPPPaaaarrrraaaalllllllleeeelllliiiizzzzaaaattttiiiioooonnnn Directives enable, disable, or modify features of the auto-parallelizer. Assertions assist the auto-parallelizer by providing it with additional information about the source program. The automatic directives and assertions do not impose parallelism; they give hints and assertions to the auto-parallelizer in order to assist it in paralleling the that the right loops. To invoke a directive or assertion, include it in the input file. Listed below are the Fortran directives and assertions for the auto-parallelizer. CCCC****$$$$**** NNNNOOOO CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTTIIIIZZZZEEEE Do not parallelize either a subroutine or file. CCCC****$$$$**** CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTTIIIIZZZZEEEE Not used. (See below.) CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO ((((CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT)))) Ignore perceived dependences between two references to the same array when parallelizing. CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO ((((SSSSEEEERRRRIIIIAAAALLLL)))) Do not parallelize the following loop. CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT CCCCAAAALLLLLLLL Ignore subroutine calls when parallelizing. CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT PPPPEEEERRRRMMMMUUUUTTTTAAAATTTTIIIIOOOONNNN ((((aaaarrrrrrrraaaayyyy____nnnnaaaammmmeeee)))) Array array_name is a permutation array. PPPPaaaaggggeeee 11116666 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO PPPPRRRREEEEFFFFEEEERRRR ((((CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT)))) Parallelize the following loop if it is safe. CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO PPPPRRRREEEEFFFFEEEERRRR ((((SSSSEEEERRRRIIIIAAAALLLL)))) Do not parallelize the following loop. Note: The general compiler option ----LLLLNNNNOOOO::::iiiiggggnnnnoooorrrreeee____pppprrrraaaaggggmmmmaaaassss causes the auto-parallelizer to ignore all of these directives and assertions. CCCC****$$$$**** NNNNOOOO CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTTIIIIZZZZEEEE The CCCC****$$$$**** NNNNOOOO CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTTIIIIZZZZEEEE directive prevents parallelization. Its effect depends on where it is placed. When placed inside a subroutine, the directive prevents the parallelization of the subroutine. In the following example, SUB1() is not parallelized. Example: SUBROUTINE SUB1 C*$* NO CONCURRENTIZE ... END When placed outside of a subroutine, CCCC****$$$$**** NNNNOOOO CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTTIIIIZZZZEEEE prevents the parallelization of all the subroutines in the file. The subroutines SUB2() and SUB3() are not parallelized in the next example. Example: SUBROUTINE SUB2 ... END C*$* NO CONCURRENTIZE SUBROUTINE SUB3 ... END The CCCC****$$$$**** NNNNOOOO CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTTIIIIZZZZEEEE directive is valid only when the ----ppppffffaaaa or ----ppppccccaaaa command-line option is used. CCCC****$$$$**** CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTTIIIIZZZZEEEE The C*$* CONCURRENTIZE directive exists only to maintain backwards compatibility, and its use is discouraged. Using the ----ppppffffaaaa or ----ppppccccaaaa option replaces using this directive. PPPPaaaaggggeeee 11117777 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO ((((CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT)))) CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO ((((CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT)))) says that when analyzing the loop immediately following this assertion, the auto-parallelizer should ignore any perceived dependences between two references to the same array. The following example is a correct use of the assertion when M > N. Example: C*$* ASSERT DO (CONCURRENT) DO I = 1, N A(I) = A(I+M) This assertion is usually used to help the auto-parallelizer with loops that have indirect array references. There are other facts to be aware of when using this assertion. If multiple loops in a nest can be parallelized, CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO ((((CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT)))) causes the auto-parallelizer to prefer the loop immediately following the assertion. The assertion does not affect how the auto-parallelizer analyzes CALL statements and dependences between two potentially aliased pointers. Note: If there are real dependences between array references, CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO ((((CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT)))) may cause the auto-parallelizer to generate incorrect code. CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO ((((SSSSEEEERRRRIIIIAAAALLLL)))) CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO ((((SSSSEEEERRRRIIIIAAAALLLL)))) instructs the auto-parallelizer to not parallelize the loop following the assertion. CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT CCCCAAAALLLLLLLL The CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT CCCCAAAALLLLLLLL assertion tells the auto- parallelizer to ignore subroutine calls contained in a loop when deciding if that loop is parallel. The assertion applies to the loop that immediately follows it and to all loops nested inside that loop. The auto-parallelizer ignores subroutine FRED() when it analyzes the following loop. C*$* ASSERT CONCURRENT CALL DO I = 1, N CALL FRED ... END DO PPPPaaaaggggeeee 11118888 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) SUBROUTINE FRED ... END To prevent incorrect parallelization, you must make sure the following conditions are met when using CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT CCCCAAAALLLLLLLL:::: A subroutine cannot read from a location inside the loop that is written to during another iteration. This rule does not apply to a location that is a local variable declared inside the subroutine. A subroutine cannot write to a location inside the loop that is read from during another iteration. This rule does not apply to a location that is a local variable declared inside the subroutine. The following code shows an illegal use of the assertion. Subroutine FRED() writes to variable T which is also read from by WILMA() during other iterations. C*$* ASSERT CONCURRENT CALL DO I = 1,M CALL FRED(B, I, T) CALL WILMA(A, I, T) END DO SUBROUTINE FRED(B, I, T) REAL B(*) T = B(I) END SUBROUTINE WILMA(A, I, T) REAL A(*) A(I) = T END By localizing the variable T, you could manually parallelize the above example safely. But, the auto-parallelizer does not know to localize T, and it illegally parallelizes the loop because of the PPPPaaaaggggeeee 11119999 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) assertion. CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT PPPPEEEERRRRMMMMUUUUTTTTAAAATTTTIIIIOOOONNNN ((((aaaarrrrrrrraaaayyyy____nnnnaaaammmmeeee)))) CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT PPPPEEEERRRRMMMMUUUUTTTTAAAATTTTIIIIOOOONNNN tells the auto-parallelizer that array_name is a permutation array: every element of the array has a distinct value. Array B is asserted to be a permutation array in this example. Example: C*$* ASSERT PERMUTATION (B) DO I = 1, N A(B(I)) = ... END DO As shown in the previous example, you can use this assertion to parallelize loops that use arrays for indirect addressing. Without this assertion, the auto-parallelizer is not able to determine that the array elements used as indexes are distinct. Note: The assertion does not require the permutation array to be dense. CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO PPPPRRRREEEEFFFFEEEERRRR ((((CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT)))) CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO PPPPRRRREEEEFFFFEEEERRRR ((((CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT)))) says that the auto-parallelizer should parallelize the loop immediately following the assertion, if it is safe to do so. The following code encourages the auto- parallelizer to run the I loop in parallel. C*$*ASSERT DO PREFER (CONCURRENT) DO I = 1, M DO J = 1, N A(I,J) = B(I,J) END DO ... END DO When dealing with nested loops, follow these guidelines: If the loop specified by this assertion is safe to parallelize, the PPPPaaaaggggeeee 22220000 AAAAUUUUTTTTOOOO____PPPP((((5555)))) AAAAUUUUTTTTOOOO____PPPP((((5555)))) auto-parallelizer chooses it to parallelize, even if other loops in the nest are safe. If the specified loop is not safe, the auto-parallelizer chooses another loop that is safe, usually the outermost. This assertion can be applied to more than one loop in a nest. In this case, the auto-parallelizer uses its heuristics to choose one of the specified loops. Note: CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO PPPPRRRREEEEFFFFEEEERRRR ((((CCCCOOOONNNNCCCCUUUURRRRRRRREEEENNNNTTTT)))) is always safe to use. The auto-parallelizer will not illegally parallelize a loop because of this assertion. CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO PPPPRRRREEEEFFFFEEEERRRR ((((SSSSEEEERRRRIIIIAAAALLLL)))) The CCCC****$$$$**** AAAASSSSSSSSEEEERRRRTTTT DDDDOOOO PPPPRRRREEEEFFFFEEEERRRR ((((SSSSEEEERRRRIIIIAAAALLLL)))) assertion requests the auto- parallelizer not to parallelize the loop that immediately follows. In the following case, the assertion requests that the J loop be run serially. DO I = 1, M C*$*ASSERT DO PREFER (SERIAL) DO J = 1, N A(I,J) = B(I,J) END DO ... END DO Using C*$* ASSERT DO PREFER (SERIAL) The assertion applies only to the loop directly after the assertion. PPPPaaaaggggeeee 22221111